fix(scheduler) Make schedule changes take effect immediately#590
fix(scheduler) Make schedule changes take effect immediately#590
Conversation
We had feedback a few months back that changes in schedules should take effect immediately, and these changes implement that with backwards compatibility for existing run state. The scenario that came up was a task was moved from every 3 hours to every 10 minutes, and the developer was surprised when their task had to wait for 2+hrs to see their task run again. Refs STREAM-676
| legacy_keys = [sk.rsplit(":", 1)[0] for sk in storage_keys] | ||
|
|
||
| new_values = self._redis.mget([self._make_key(sk) for sk in storage_keys]) | ||
| legacy_values = self._redis.mget([self._make_key(lk) for lk in legacy_keys]) | ||
|
|
||
| run_times: dict[str, datetime | None] = {} | ||
| for storage_key, new_val, legacy_val in zip(storage_keys, new_values, legacy_values): | ||
| raw = new_val if new_val is not None else legacy_val | ||
| run_times[storage_key] = datetime.fromisoformat(raw) if raw else None |
There was a problem hiding this comment.
Bug: The backwards-compatibility fallback for legacy run state can prevent a task from running immediately after a schedule change, delaying its execution until the new interval has passed.
Severity: MEDIUM
Suggested Fix
The backwards-compatibility logic should be adjusted to ignore the legacy last_run value if the schedule has changed. One approach is to have read_many not perform the fallback at all, forcing an immediate run on the new schedule. Alternatively, the calling function _load_last_run could compare the schedule ID from the new key with the schedule that produced the legacy data (if possible to determine) and ignore the legacy data if they differ.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.
Location: clients/python/src/taskbroker_client/scheduler/runner.py#L83-L91
Potential issue: The backwards-compatibility fallback in `read_many()` can prevent a
task from running immediately after a schedule change. If a task ran recently under an
old schedule, its last run time is stored in a legacy-formatted Redis key. When the code
is updated and the schedule is changed simultaneously, the new logic will read the last
run time from the legacy key. This causes the `is_due()` calculation to incorrectly
delay the task's next execution, respecting the old run time instead of triggering
immediately, which contradicts the feature's goal.
Did we get this right? 👍 / 👎 to inform future reviews.
| Retrieve last run times in bulk. | ||
|
|
||
| storage_keys are the new-format keys including the schedule_id suffix | ||
| (e.g. "test:valid:300"). Falls back to the legacy key (derived by |
There was a problem hiding this comment.
This is a contrived example, but:
storage_keys = [`test:val:300`, `test:val:3000`]
legacy_keys = [`test:val`, `test:val`]
new_values = [123, None] # No value for 3000
legacy_values = [321, 321] # We have a legacy value
run_times[`test:val:300`] = 123
run_times[`test:val:3000`] = 321
Is this scenario possible? And if so, does this result make sense?
There was a problem hiding this comment.
Is this scenario possible? And if so, does this result make sense?
If I understand correctly, a task has both a legacy + new key (test:val and test:val:300) and then the schedule is changed from 300 -> 3000? In that scenario we shouldn't have both test:val:300 and test:val:3000 in storage_keys because test:val is the scheduled task name, and task names would be unique.
There was a problem hiding this comment.
But isn't it possible to create such a scenario anyways? For example, if the scheduler command has...
scheduler.add(
"simple-task", {"task": "test:val", "schedule": timedelta(seconds=300)}
)
...
scheduler.add(
"simple-task", {"task": "test:val", "schedule": timedelta(seconds=3000)}
)... wouldn't this recreate the example Evan proposed?
There was a problem hiding this comment.
Yeah that scenario would trigger the problem. Currently having two schedule entries with the same task and different schedules results in only one of the schedules being followed because the storage key is shared.
We currently prevent tasks from having two different schedules in sentry with a test
So the problematic scenario has been prevented for now. As we on-board more applications though we should make this more resilient to operator error. Using the schedule name as the storage key would help with that. I can follow up these changes with that improvement.
In #590 I made schedule storage keys embed their schedule data, and another issue was identified that a task with two schedules could have an overlapping storage key. These changes embed the schedule entry name into the storage key to ensure that schedule timing is unique for each entry. Now changing, the schedule name, task name, or schedule timing will reset the schedule timing. Refs STREAM-676
In #590 I made schedule storage keys embed their schedule data, and another issue was identified that a task with two schedules could have an overlapping storage key. These changes embed the schedule entry name into the storage key to ensure that schedule timing is unique for each entry. Now changing, the schedule name, task name, or schedule timing will reset the schedule timing. Refs STREAM-676 * Fix compiler warning from new rust
We had feedback a few months back that changes in schedules should take effect immediately, and these changes implement that with backwards compatibility for existing run state.
The scenario that came up was a task was moved from every 3 hours to every 10 minutes, and the developer was surprised when their task had to wait for 2+hrs to see their task run again.
Refs STREAM-676